For example CALLHOME: Egyptian Arabic Speech Translation Corpus
Your answer
Subsets
The different subsets in the dataset if it is broken by dialects. For example, Algerian , 2000, sentences. Put every subset in a new line in the format subset-name, number of samples, type [tokens sentences, documents]
used mixed if the dataset contains multiple dialects
Choose
ar-CLS: (Arabic (Classic))
ar-MSA: (Arabic (Modern Standard Arabic))
ar-AE: (Arabic (United Arab Emirates))
ar-BH: (Arabic (Bahrain))
ar-DJ: (Arabic (Djibouti))
ar-DZ: (Arabic (Algeria))
ar-EG: (Arabic (Egypt)),
ar-IQ: (Arabic (Iraq))
ar-JO: (Arabic (Jordan))
ar-KM: (Arabic (Comoros))
ar-KW: (Arabic (Kuwait))
ar-LB: (Arabic (Lebanon))
ar-LY: (Arabic (Libya))
ar-MA: (Arabic (Morocco))
ar-MR: (Arabic (Mauritania))
ar-OM: (Arabic (Oman))
ar-PS: (Arabic (Palestine))
ar-QA: (Arabic (Qatar))
ar-SA: (Arabic (Saudi Arabia))
ar-SD: (Arabic (Sudan))
ar-SO: (Arabic (Somalia))
ar-SS: (Arabic (South Sudan))
ar-SY: (Arabic (Syria))
ar-TN: (Arabic (Tunisia))
ar-YE: (Arabic (Yemen))
ar-LEV: (Arabic(Levant))
ar-NOR: (Arabic (North Africa))
ar-GLF: (Arabic (Gulf))
mixed
Domain *
Choose
social media
news articles
reviews
commentary
books
transcribed audio
wikipedia
web pages
other
Form *
Collection Style *
Description *
brief description of the dataset
Your answer
Volume *
How many samples are in the dataset, this is closely related to the unit option. As an example if the dataset has 10K tweets you put the Volume: 10,000 and the Unit: sentences. Please don't use 10K or any abbreviations.
Your answer
Unit *
tokens usually used for ner, pos tagging, etc. sentences for sentiment analysis , documents for text modelling tasks
Ethical Risks
social media datasets are considered mid risks as they might release personal information, others might contain hate speech as well so considered as high risk.
Clear selection
Provider
Name of institution i.e. NYU Abu Dhabi
Your answer
Derived From
If the dataset is extracted or collected from another dataset put the name of such dataset